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Abstract 

In the past years, RNA sequencing has become the method of choice 
for the study of transcriptome composition. When working with this type 
of data, several tools exist to quantify differences in splicing across con- 
ditions and to address the significance of those changes. However, the 
number of genes predicted to undergo differential splicing is often high, 
and further interpretation of the results becomes a challenging task. Here 
we present SwitchSeq, a novel set of tools designed to help the users in 
the interpretation of differential splicing events that affect protein cod- 
ing genes. More specifically, we provide a framework to identify switch 
events, i.e., cases where, for a given gene, the identity of the most abun- 
dant transcript changes across conditions. The identified events are then 
annotated by incorporating information from several public databases and 
third-party tools, and are further visualised in an intuitive manner with 
the independent R package tviz. All the results are displayed in a self- 
contained HTML document, and are also stored in txt and json format 
to facilitate the integration with any further downstream analysis tools. 
Such analysis approach can be used complementarily to Gene Ontology 
and pathway enrichment analysis, and can also serve as an aid in the 
validation of predicted changes in mRNA and protein abundance. 

Availability: The latest version of SwitchSeq, including installation in- 
structions and use cases, can be found at https : //github . com/mgonzalezporta/ 
SwitchSeq Additionally, the plot capabilities are provided as an indepen- 
dent R package at https://github.com/mgonzalezporta/tviz 

brazmaOebi .ac.uk 
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1 Introduction 



In the past years, RNA sequencing has become the method of choice for the 
study of transcriptome composition. Among the strengths of this technology 
there is the possibility to perform a broad range of analyses without the need 
for intricate experimental designs: e.g., identification of novel transcribed re- 
gions, deconvolution of allele specific expression and study of alternative splic- 
ing, to mention a few. Regarding this last application, several tools exist to 



estimate transcript expression levels (e.g. MISO (Katz et al. 2010), Cufflinks 



(Trapnell et al. 2010), MMSEQ (Turro et al. 2011)) and to predict differences 



in splicing across conditions and assess their significance (e.g. MMDIFF (Turro 



et al. 2013), Cuffdiff (Trapnell et al. 2013), DEXSeq (Anders et al. 2012 j). 



Quite often though, these tools result in a long list of features (e.g., genes) 
that is difficult to interpret. On the other hand, we recently found that gene 
expression is often dominated by one single transcript in protein coding genes 
( Gonzalez-Porta et al. 2013), and introduced the concept of switch event to 



refer to those cases where, for a given gene, the identity of the most abundant 
transcript changes across conditions. With both of these situations in mind, we 
have developed SwitchSeq, the first tool for the identification, annotation and 
visualisation of the most extreme and prevalent changes in splicing that affect 
protein coding genes. 



2 Methods 

2.1 Identification and annotation of switch events in pro- 
tein coding genes with SwitchSeq 

SwitchSeq relies on transcript level quantifications in order to identify switch 
events. This step can be divided into two subtasks: (i) identification of the 
most abundant transcript within each gene, and (ii) discovery of cases where the 
identity of such transcript changes across conditions. SwitchSeq takes as input a 
matrix of normalised counts (TPMs, FPKMs/RPKMs, etc.), thus being entirely 
detached from any specific mapping and quantification strategy (Figure [l]). 
The input table can contain any number of genes and transcripts, but a useful 
approach is to focus on the genes for which the user has already predicted 
differences in splicing through other tools (e.g. MMDIFF, Cuffdiff, DEXSeq). 
In this way, it is possible to obtain a subset of that list that contains the most 
extreme and prevalent changes in splicing across the studied conditions. In 
addition, tools like MMDIFF provide a significance threshold for each of the 
tested transcripts, and such list of significant events can be passed to SwitchSeq 
to filter transcripts accordingly. 

SwitchSeq combines several sources of information to facilitate the interpretation 
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of the identified switch events, including data from public databases and results 
from the execution of third-party tools. This functionality depends on several 
annotation files, which can be easily generated by the user with the provided 
wrapper script (Figure [T]). Specifically, the output of the annotation stage 
contains the following information: (i) the biotype of the transcripts involved 
in the switch, which is retrieved from Ensembl (Flicek et al. 2013) and can be 
informative of their function (e.g. the most common biotypes for transcripts 
derived from protein coding genes are protein coding, nonsense mediated decay, 
retained intron and processed transcript); (ii) whether the identified transcripts 
are classified as principal isoforms in APPRIS (Rodriguez et al. 2013), a re- 
source that aims at selecting a set of representative transcripts for each protein 
coding gene based on conservation data and protein structural and functional 
information; (iii) the expression breadth for each of the identified transcripts, 
i.e. the number of samples in which the transcript is detected as major rel- 
ative to the number of samples in which the gene is expressed (calculated by 
SwitchSeq); and (iv) in the cases in which the switch involves two protein cod- 
ing transcripts, the overlap in their coding sequences (calculated by SwitchSeq 
by performing a global alignment with EMBOSS Needle (Rice et al. 20001), as 



well as the protein sequences themselves for further analysis of potential func- 
tional differences with tools like MAISTAS ( |Floris et al. 2011 ), and links to the 
available structures for the protein, if any, including a graphic representation of 
their coverage (through the UniPDB widget). 



2.2 Visualisation of switch events with tviz 

For each switch event, two types of plots are generated to enable visual inspec- 
tion of the predicted changes (Figure [T]). On the one hand, summary plots 
include combined information on the expression levels across all the samples in 
each condition. More specifically, they include information on the distribution 
of gene expression levels, relative abundances for all the annotated transcripts 
and major transcript dominance {i.e., difference in the abundance of the first 
vs. the second most abundant transcripts of each gene). These data are rep- 
resented in the form of box plots or as the average plus error bars depending 
on the number of samples available. On the other hand, sample-specific visual- 
isation of transcript expression levels is achieved with star plots. In such plots 
each sample is plotted as a pie chart, where transcript abundance is represented 
by the size of each slice, and gene expression is represented by the overall size 
of the plot, thus allowing comparisons across samples. Both types of plots can 
be produced for any gene and independently of SwitchSeq execution with the R 
package tviz. 
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(a) 
INPUT 

expression levels 

(TPMs, RPKMs/FPKMs, etc.) 



gene_id 


transcript_id 


samplel sample2 ... 









annotation 
switchseq -t get_data 



SwitchSeq + tviz 

switchseq -t get_switch 

4- 

OUTPUT 

■ self-contained html (+ txt, json) 

■ high resolution plots 
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ENSG00000124193 

gene expression summary (RPKMs): 
condition 1 (left) - mean-32.88, sd-4.75 
condition 2 (right) - mean-37.49 T sd-5.72 

dominance summary: 
condition 1 (left) - mean-0.39, sd-0.11 
condition 2 (right) - mean-0.53, sd-0.36 

legend: 

nonsense_mediated_decay - ENST00000483871 
protein_coding - ENST00000244020 



Figure 1| Overview of SwitchSeq. (a) The analysis workflow. SwitchSeq 
input consists of a matrix of normalised transcript-level counts, as well as 
several annotation files that can be easily obtained with the provided wrap- 
per tool. This information is used to identify and annotate switch events, 
and the results are stored in a self-contained HTML report, as well as in txt 
and json format. Furthermore, SwitchSeq produces plots for the visualisa- 
tion of the identified events, which can be easily accessed from the report. 
Such ploting capabilities have been implemented in an independent R pack- 
age (tviz), and hence can be used independently of any SwitchSeq execution, 
(b) Identification and annotation of switch events. Based on the expression 
data provided by the user, SwitchSeq identifies the most abundant transcript 
within each gene and detects cases where its identity changes across condi- 
tions, here referred to as switch events. Following the identification stage, 
information from several public databases is incorporated to aid in the inter- 
pretation of the events, (c) Visualisation of switch events. Two types of plots 
are automatically generated by SwitchSeq: summary plots and star plots. 
Summary plots contain information on the distribution of gene expression 
levels, major transcript dominance and transcript relative abundances across 
all the samples that belong to each condition. Star plots provide information 
on the transcript expresison estimates in a sample-specific manner. Example 
plots that show a switch event between the two annotated transcripts of the 
human gene SRSF6 (ENSG00000124193) are included here: the protein cod- 
ing transcript ENST00000244020 is the major isoform in all the three samples 
of the first condition (control - C), whilst the nonsense-mediated-decay tran- 
script ENST00000483871 becomes the most abundant isoform in the second 
one (knock-down - KD). 
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2.3 SwitchSeq output 

The identified switch events and associated information are reported in a self- 
contained HTML document, where the results can be easily filtered, formatted 
and sorted (including ranking to maximise expression breadth in both condi- 
tions). Such HTML also contains additional information on the options used 
during the execution. Finally, the results are also available in txt and json 
format, to facilitate the integration with any further downstream analysis tools. 

3 Conclusion 

Evaluating differences in splicing across conditions has become a common task 
in any RNA-seq data analysis pipeline. However, the number of genes predicted 
to undergo changes in splicing is often high, and further interpretation of the 
results becomes a challenging task. SwitchSeq has been designed to help the 
users in the interpretation of differential splicing events that affect protein cod- 
ing genes, by letting them focus on the most extreme and prevalent changes. 
The identified switch events are further annotated through the combination of 
expression data and information retrieved from several public databases, and 
can be also visualised in an intuitive manner. Such analysis approach can be 
used complement arily to Gene Ontology and pathway enrichment analysis, can 
serve as an aid in the validation of predicted changes in mRNA or protein 
abundance, and can also be applied to investigate the functional impact of the 
detected alterations in splicing across conditions (e.g., healthy vs. tumour sam- 
ples). SwitchSeq has been written in Perl and R and is available as open source 
software, thus enabling further customisation. 
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